library(tidyverse) # for data cleaning and plotting
library(tidymodels) # for modeling ... tidily
library(lubridate) # for date manipulation
library(openintro) # for the abbr2state() function
library(gplots) # for col2hex() function
library(RColorBrewer) # for color palettes
library(ggthemes) # for more themes (including theme_map())
library(plotly) # for the ggplotly() - basic interactivity
library(gganimate) # for adding animation layers to ggplots
library(transformr) # for "tweening" (gganimate)
library(gifski) # need the library for creating gifs but don't need to load each time
library(shiny) # for creating interactive apps
library(janitor)
library(stacks) # for stacking models
library(glmnet) # for regularized regression, including LASSO
library(ranger) # for random forest model
library(kknn) # for knn model
library(naniar) # for examining missing values (NAs)
library(moderndive) # for King County housing data
library(vip) # for variable importance plots
library(rmarkdown) # for paged tables
library(fastDummies)
library(usemodels) # for suggesting step_XXX() functions
library(readr)
library(kableExtra)
library(DALEX) # model Agnostic Language for exploration and explanation (for model interpretation)
library(DALEXtra)
theme_set(theme_minimal())
Introduction
The financial market is a strange place that is very hard to navigate around. We have seen Warren Buffett, Ray Dalio, Charlie Munger - the very very best who have had billions of dollars. Others, 95% of the population, lose the money instead.
So, how are the best of the best pick its stocks? It is from the fundamental, the technical side or the sentimental side? Within this paper, we hope to bring another perspective, using machine learning models to predict the profitability of the stock price.
Paper Outline
First, we will list and explain the definition of each variable in our dataset. There are 25 variables in total as listed below. The variable we are going to predict is PROFIT. Then, we will start to process our data by merging with the macro data and removing some unwanted variables. After processing the data, we will have a data visualization section that shows the distribution and relationship between some variables. And to find the best model, we choose to explore three machine learning models which are LASSO, Random forest and Stacking models. We use a stacking method to create the stacking model by combining three models (Lasso, Random Forest, and KNN). Lastly, we will make a stock return prediction for 2021 using the trained model with the lowest R-squared and RMSE.
List of Variables
For the dataset, we includes financial information on companies in the S&P 500 stock index from 1999-2021. This information was scraped from Yahoo Finance in November of 2021, and collected in a csv format for data analysis. The information includes metrics like sales, earnings, cogs, stock price, and market sector as well as macroeconomic data such as GDP or Money Supply. The goal is to analyze and model this data to better improve projections for a company’s future profitability. The variables in the data set are described below:
| YEAR |
The financial year of the company |
| COMPANY |
The company’s stock abbreviation symbol |
| MARKET.CAP |
The total market capitalization of the company (Volume * Price) |
| EARNINGS |
The earnings in dollars for the previous year for the given company |
| SALES |
How much the company sold in dollars last year |
| CASH |
How much cash the company has in dollars at the end of the previous year |
| Name |
The full name of the company |
| Sector |
The name of the sector that the company is a part of |
| Earnings_next_year |
The amount of money in dollars that the company earns in the following year |
| PRICE |
The price of the stock when it is bought |
| Sell |
The price of the stock when it is sold |
| VOLUME |
The total number of shares that the company holds |
| COGS |
The total amount the company paid as a cost directly related to the sale of products |
| INVESTMENT |
The total asset or item acquired with the goal of generating income or appreciation |
| RECIEVABLE |
The debts owed to a company by its customers for goods that have been delivered or used but not yet paid for |
| INVENTORY |
How much raw materials used in production as well as the goods produced that are available for sale |
| DEBTS |
How much money the company borrow from other parties |
| CPALTT01USM657N_PC1 |
The percentage change in CPI (measure of inflation) |
| GDP |
The monetary value of all finished goods and services made within a country |
| GDP_PC1 |
The percentage change in GDP |
| T10Y2Y |
Ten year treasury bonds minus two year treasury bonds |
| M1SL |
The total currency and other liquid instruments in a country’s economy |
| M1SL_PC1 |
The percentage change in money supply |
| Earnings_next_year |
How much profit that a company produces next year |
| PROFIT |
How much the money made or lost on an investment |
Loading data
finalDATASET <- read_csv("FINALDATASET.csv")
macro_data <- read_csv("clean_macro - Sheet1.csv")
Data Preprocessing
For the data preprocessing, we combined our data sources and filled missing values for some fundamental factors by using the median of that value in the sector for a certain year. We also created sector dummies in this step. We then dropped some variables that are not in our interests and the splited the testing and trainning data.
final_data <- finalDATASET %>%
# make Sector dummy variables
dummy_cols(select_columns = "Sector") %>%
# merge with updated macro data
select(-CPALTT01USM657N_PC1,
-GDP,
-GDP_PC1,
-M1SL_PC1 ,
-M1SL,
-PRICE,
-Sell,
-COMPANY) %>%
merge(macro_data) %>%
# convert Macro factors from characters to numeric
mutate(across(c("CPALTT01USM657N_PC1","GDP","GDP_PC1","T10Y2Y","M1SL_PC1","M1SL"),
as.numeric)) %>%
group_by(YEAR,Sector) %>%
# replacing the missing value with median od the industry in that year
mutate(across(c(DEBTS,INVESTMENTS,CASH,VOLUME,EARNINGS,COGS,SALES,RECEIVABLE,INVENTORY),~replace(.,.==0,median(.)))) %>%
# delete less important factors -> Exchange can possibily be deleted
ungroup() %>%
mutate(across(c(!where(is.numeric),-"Name"),as.factor)) %>%
select(-Earnings_next_year,-observation_date) %>%
drop_na()
# filter out the data for 2021
final_data_2021 <- final_data %>%
filter(YEAR == 2021)
final_data <- final_data %>%
filter(YEAR < 2021)
# split the data
set.seed(327) #for reproducibility
data_split <- initial_split(final_data,
prop = .75)
data_training <- training(data_split)
data_testing <- testing(data_split)
# quick look of the data
final_data %>%
head(5)
Data Visualization
We first explored distributions of our predictors and outcome. As the following graph suggested, all the numeric variables are severely right-skewed with some outliers. In our data, though the numbers of firms in each sector are unbalanced, we got a decent amount of data for each sector.
final_data %>%
select(where(is.numeric)) %>%
select(-starts_with("Sector_")) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(vars(variable),
scales = "free",
nrow = 4,
ncol=5)

final_data %>%
select(where(is.factor)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(vars(variable),
scales = "free",
nrow = 4) +
theme(axis.text.x = element_text(angle = 45))

Then, we continued to explore the relationship between regressors and the outcome. By observation, we can’t see a strong correlation between fundemental factors, such as investments and debts, and stock return. It may indicate that linear model is not an ideal model in this case.
final_data %>%
ggplot(aes(x = INVESTMENTS, y = PROFIT,color = YEAR)) +
geom_point(alpha = 0.5)+
geom_smooth(se = FALSE) +
labs(title = "Relationship between investments and stock return")

final_data %>%
ggplot(aes(x = DEBTS, y = PROFIT,color = YEAR)) +
geom_point(alpha = 0.5)+
geom_smooth(se = FALSE) +
labs(title = "Relationship between debts and stock return")

With this anamiation, we could see that the returns of stocks follow a cycle, which may be influenced by macroeconomic condition. The return in 1999 and 2019 seems to be the highest for all industries. Also, we noticed that some industries vary a lot year to year, such as IT and healthcare industry.
sector_return_an<-final_data %>%
ggplot(aes(x = PROFIT, y = Sector)) +
geom_boxplot(aes(color = Sector),
alpha = .8,
size = 1) +
labs(title = "Spread of stock return in different sector",
subtitle = "YEAR: {closest_state}",
color = "") +
transition_states(YEAR)
animate(sector_return_an, duration = 25)
anim_save("sector_return.gif")
knitr::include_graphics("sector_return.gif")

We further explored the relationship between the industry and the potential macroeconomic influencer using graphs below. Though the relationship doesn’t seem to be linear, we do see how macroeconomic condition affect each industry differently and will account for that in our model.
final_data %>%
group_by(YEAR,Sector) %>%
mutate(profit_por = PROFIT*`MARKET CAP`/(sum(`MARKET CAP`,na.rm = TRUE))) %>%
summarise(Sector_profit = sum(profit_por),
GDP_PC1 = mean(GDP_PC1),
CPALTT01USM657N_PC1 = mean(CPALTT01USM657N_PC1),
Sector = Sector[1]) %>%
ggplot(aes(x = CPALTT01USM657N_PC1,y = Sector_profit,color = Sector))+
geom_smooth(se = FALSE) +
labs(title = "Relationship between inflation and average return of a industry",
x = "Percent Change in CPI",
y = "Average Return of a sector")

final_data %>%
group_by(YEAR,Sector) %>%
mutate(profit_por = PROFIT*`MARKET CAP`/(sum(`MARKET CAP`,na.rm = TRUE))) %>%
summarise(Sector_profit = sum(profit_por),
GDP_PC1 = mean(GDP_PC1),
CPALTT01USM657N_PC1 = mean(CPALTT01USM657N_PC1),
Sector = Sector[1]) %>%
ggplot(aes(x = GDP_PC1,y = Sector_profit,color = Sector))+
geom_smooth(se = FALSE) +
labs(title = "Relationship between GDP and average return of a industry",
x = "Percent Change in GDP",
y = "Average Return of a sector")

Lasso Model
Building Recipe
In the lasso model, to account for the fact the macroeconomic condition affects each industry differently, we also included the interaction term between GDP and sector dummies.
return_recipe <- recipe(PROFIT ~ ., #short-cut, . = all other vars
data = data_training) %>%
# filter to only have data after 2020
step_filter(YEAR<2021) %>%
step_rm(Name,Sector,YEAR) %>%
#step_rm(GDP,M1SL,Name,Sector,YEAR) %>%
# add PE
step_mutate(PE = `MARKET CAP`/EARNINGS) %>%
# Normalize all variables except for GDP
step_normalize(all_predictors(),
-all_nominal(),
-starts_with("Sector_")) %>%
# Create interaction terms
step_interact(terms = ~c(GDP_PC1):starts_with("Sector_"))
# show the data in recipe
return_recipe %>%
prep(data_training) %>%
# using bake(new_data = NULL) gives same result as juice()
# bake(new_data = NULL)
juice()
Select tuning parameter
return_linear_mod <-
# Define a lasso model
# I believe default is mixture = 1 so probably don't need
linear_reg(mixture = 1) %>%
# Set the engine to "glmnet"
set_engine("glmnet") %>%
# The parameters we will tune.
set_args(penalty = tune()) %>%
# Use "regression"
set_mode("regression")
set.seed(456)
return_lm_wf <-
# Set up the workflow
workflow() %>%
# Add the recipe
add_recipe(return_recipe) %>%
# Add the modeling
add_model(return_linear_mod)
penalty_grid <- grid_regular(penalty(),
levels = 10)
return_cv <- vfold_cv(data_training, v = 5)
return_lm_tune <-
return_lm_wf %>%
tune_grid(
resamples = return_cv,
grid = penalty_grid
)
best_param<-return_lm_tune %>%
select_best(metric = "rmse")
return_lasso_final_wf <- return_lm_wf %>%
finalize_workflow(best_param)
return_lasso_final_mod <- return_lasso_final_wf %>%
fit(data = data_training)
# visulization for best param
set.seed(456)
return_lm_tune %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
scale_x_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10",scales::math_format(10^.x))) +
labs(x = "penalty", y = "rmse")

Lasso results
The table below shows the estimate of each predictor in the Lasso Model. We can see that indicators in macroeconomics are relatively important to predict the stock return. The change in inflation and GDP all remain significant after shrinking. The interaction terms between GDP and industries also showed importance, accounting for the fact that the macroeconomic condition affects each industry differently: GDP seems to affect the stock return in Communication Services, Energy, and Consumer Discretionary sectors less.
return_lasso_final_mod %>%
pull_workflow_fit() %>%
tidy() %>%
arrange(desc(estimate)) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "500px")
|
term
|
estimate
|
penalty
|
|
M1SL
|
52.2489802
|
0.0059948
|
|
CPALTT01USM657N_PC1
|
22.8362276
|
0.0059948
|
|
GDP_PC1
|
18.5139718
|
0.0059948
|
|
(Intercept)
|
17.4206737
|
0.0059948
|
|
GDP_PC1_x_Sector_Utilities
|
14.5169194
|
0.0059948
|
GDP_PC1_x_Sector_Consumer Staples
|
10.9623359
|
0.0059948
|
GDP_PC1_x_Sector_Health Care
|
10.0994574
|
0.0059948
|
|
Sector_Information Technology
|
9.1062940
|
0.0059948
|
|
T10Y2Y
|
8.3785234
|
0.0059948
|
GDP_PC1_x_Sector_Real Estate
|
7.9131832
|
0.0059948
|
|
Sector_Communication Services
|
7.4629557
|
0.0059948
|
|
Sector_Health Care
|
5.9942110
|
0.0059948
|
|
Sector_Consumer Discretionary
|
4.8412271
|
0.0059948
|
GDP_PC1_x_Sector_Information Technology
|
4.0219608
|
0.0059948
|
|
Sector_Energy
|
2.7587880
|
0.0059948
|
|
COGS
|
2.2012752
|
0.0059948
|
|
GDP_PC1_x_Sector_Industrials
|
2.0052201
|
0.0059948
|
|
VOLUME
|
0.9051955
|
0.0059948
|
|
INVESTMENTS
|
0.4845378
|
0.0059948
|
|
CASH
|
0.4031170
|
0.0059948
|
|
EARNINGS
|
0.3671604
|
0.0059948
|
|
RECEIVABLE
|
0.1688682
|
0.0059948
|
|
INVENTORY
|
0.1677270
|
0.0059948
|
|
GDP_PC1_x_Sector_Financials
|
0.0413482
|
0.0059948
|
|
Sector_Materials
|
0.0000000
|
0.0059948
|
|
PE
|
-0.2304146
|
0.0059948
|
|
DEBTS
|
-0.2444685
|
0.0059948
|
|
Sector_Industrials
|
-0.4250661
|
0.0059948
|
|
GDP_PC1_x_Sector_Materials
|
-1.0239217
|
0.0059948
|
GDP_PC1_x_Sector_Communication Services
|
-1.4192763
|
0.0059948
|
|
Sector_Financials
|
-1.5534082
|
0.0059948
|
|
SALES
|
-3.2886701
|
0.0059948
|
|
Sector_Real Estate
|
-3.3504621
|
0.0059948
|
|
MARKET CAP
|
-3.8678288
|
0.0059948
|
|
Sector_Consumer Staples
|
-5.4154183
|
0.0059948
|
|
GDP_PC1_x_Sector_Energy
|
-5.8821361
|
0.0059948
|
GDP_PC1_x_Sector_Consumer Discretionary
|
-8.6297519
|
0.0059948
|
|
Sector_Utilities
|
-8.7405789
|
0.0059948
|
|
M1SL_PC1
|
-11.4973966
|
0.0059948
|
|
GDP
|
-16.0930330
|
0.0059948
|
Model Evaluation
Prediciton precision
The Lasso model does not perform well. It only explains 20% of the varaince in stock return and it also has a relatively high rmse of around 46.
return_lm_tune %>%
select(id, .metrics) %>%
unnest(.metrics) %>%
filter(.metric == "rsq") %>%
filter(.config == "Preprocessor1_Model08") %>%
summarise(mean_rsq = mean(.estimate))
return_lm_tune %>%
select(id, .metrics) %>%
unnest(.metrics) %>%
filter(.metric == "rmse") %>%
filter(.config == "Preprocessor1_Model08") %>%
summarise(mean_rmse = mean(.estimate))
set.seed(456)
prediction <- predict(
return_lasso_final_mod,
new_data = data_training)
training_pred<-data_training %>%
mutate(.pred = prediction$.pred)
training_pred %>%
ggplot(aes(x = PROFIT,
y = .pred,
color = YEAR)) +
geom_point(alpha = .5,
size = .5) +
geom_smooth(se = FALSE) +
geom_abline(slope = 1,
intercept = 0,
color = "darkred") +
geom_text(aes(label = Name),label.size = 0.15,data = training_pred %>% filter(PROFIT>400) )+
labs(x = "Actual Return",
y = "Predicted Return") +
scale_color_viridis_b()
Overfitting
Below shows the r-square and rmse on testing data. They are very similar to those on trainning data so we would safely conclude that our model did not overfit.
return_lasso_test <- return_lasso_final_wf %>%
last_fit(data_split)
# Metrics for model applied to test data
return_lasso_test %>%
collect_metrics()
The following graphs show the performance of the model on the testing data:
test_prediction <- predict(
return_lasso_final_mod,
new_data = data_testing)
testing_lasso_pred<-data_testing %>%
mutate(.pred = test_prediction$.pred)
testing_lasso_pred %>%
ggplot(aes(x = PROFIT,
y = .pred,
color = YEAR)) +
geom_point(alpha = .5,
size = .5) +
geom_smooth(se = FALSE) +
geom_abline(slope = 1,
intercept = 0,
color = "darkred") +
geom_text(aes(label = Name),label.size = 0.15,data = testing_lasso_pred %>% filter(PROFIT>300) )+
labs(x = "Actual Return",
y = "Predicted Return") +
scale_color_viridis_b()

Interpretable Machine Learning
Again, the boxplot and the histogram of residuals, the model does not predict the return well and this may because some extreme values.
lasso_explain <-
explain_tidymodels(
model = return_lasso_final_mod,
data = data_training %>% select(-PROFIT),
y = data_training %>% pull(PROFIT),
label = "lasso"
)
## Preparation of a new explainer is initiated
## -> model label : lasso
## -> data : 7172 rows 30 cols
## -> data : tibble converted into a data.frame
## -> target variable : 7172 values
## -> predict function : yhat.workflow will be used ( [33m default [39m )
## -> predicted values : No value for predict function target column. ( [33m default [39m )
## -> model_info : package tidymodels , ver. 0.1.3 , task regression ( [33m default [39m )
## -> predicted values : numerical, min = -49.61277 , mean = 19.22262 , max = 121.2461
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -146.2719 , mean = -4.401566e-14 , max = 1002.161
## [32m A new explainer has been created! [39m
lasso_mod_perf <- model_performance(lasso_explain)
hist_plot <-
plot(lasso_mod_perf,
geom = "histogram")
box_plot <-
plot(lasso_mod_perf,
geom = "boxplot")
hist_plot

box_plot

Below shows the feature importance plots generated by two different methods. We could see that macroeconomic indicators remained importance in both of the plots but market cap seems to be an crucial factor when we use permutation method.
set.seed(10) #since we are sampling & permuting, we set a seed so we can replicate the results
lasso_var_imp <-
model_parts(
lasso_explain
)
plot(lasso_var_imp, show_boxplots = TRUE)

return_lasso_final_mod %>%
pull_workflow_fit() %>%
vip()

Random Forest Model
Building Random Forest model
To build the random forest model, we set up recipe using our training data (data_training), define model with mtry = 6, min_n = 10 and numbers of tree = 200, create ranger workflow, and then we can fit the model.
# set up recipe
ranger_recipe <-
recipe(PROFIT ~ ., #short-cut, . = all other vars
data = data_training) %>%
step_filter(YEAR<2021) %>%
# remove the unwanted variables
step_rm(YEAR,Name,GDP,M1SL,Sector) %>%
# add PE
step_mutate(PE = `MARKET CAP`/EARNINGS)
#define model
ranger_spec <-
rand_forest(mtry = 6,
min_n = 10,
trees = 200) %>%
set_mode("regression") %>%
set_engine("ranger")
#create workflow
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
#fit the model
set.seed(712) # for reproducibility - random sampling in random forest choosing number of variables
ranger_fit <- ranger_workflow %>%
fit(data_training)
Random Forest results
The table below shows the OOB error (MSE), OOB RMSE (Root mean square error), and R squared of the stock return predictions using the random forest model above.
OOB error:
ranger_fit$fit$fit$fit$prediction.error
## [1] 1545.135
OOB RMSE:
sqrt(ranger_fit$fit$fit$fit$prediction.error)
## [1] 39.3082
R Squared:
ranger_fit$fit$fit$fit$r.squared
## [1] 0.4086307
Model Evaluation
Prediciton precision (Traning data)
Even though the mean rmse from the random forest (39.30787) seems to be fairly high but it is still lower than the LASSO model of 46. Thus, random forest model performs better.
set.seed(1211) # for reproducibility
data_cv <- vfold_cv(data_training, v = 5)
metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()
ranger_cv <- ranger_workflow %>%
fit_resamples(data_cv,
metrics = metric,
control = ctrl_res)
# Evaluation metrics averaged over all folds:
collect_metrics(ranger_cv)
We also plot a graph showing the actual return vs predicted return.
ranger_prediction <- predict(
ranger_fit,
new_data = data_training)
ranger_training_pred<-data_training %>%
mutate(.pred = ranger_prediction$.pred)
ranger_training_pred %>%
ggplot(aes(x = PROFIT,
y = .pred)) +
geom_point(alpha = .5,
size = .5) +
geom_smooth(se = FALSE) +
geom_abline(slope = 1,
intercept = 0,
color = "darkred") +
labs(x = "Actual Return",
y = "Predicted Return")

Prediciton precision (Testing data)
The table below shows the rmse on testing data. It is a bit higher but fairly similar to the error on training data so we could say that our model did not overfit.
set.seed(1211) # for reproducibility
data_cv <- vfold_cv(data_testing, v = 5)
metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()
ranger_cv <- ranger_workflow %>%
fit_resamples(data_cv,
metrics = metric,
control = ctrl_res)
# Evaluation metrics averaged over all folds:
collect_metrics(ranger_cv)
Below is the graph showing the actual return vs. predicted return on testing data.
ranger_test_prediction <- predict(
ranger_fit,
new_data = data_testing)
ranger_test_pred<-data_testing %>%
mutate(.pred = ranger_test_prediction$.pred)
ranger_test_pred %>%
ggplot(aes(x = PROFIT,
y = .pred)) +
geom_point(alpha = .5,
size = .5) +
geom_smooth(se = FALSE) +
geom_abline(slope = 1,
intercept = 0,
color = "darkred") +
labs(x = "Actual Return",
y = "Predicted Return")

Interpretable Machine Learning
Based on the box-plot and the histogram below, the residuals mostly lie between -50 to 50. But there are some a few outliners that can go up to 400.
rf_explain <-
explain_tidymodels(
model = ranger_fit,
data = data_training %>% select(-PROFIT),
y = data_training %>% pull(PROFIT),
label = "rf"
)
## Preparation of a new explainer is initiated
## -> model label : rf
## -> data : 7172 rows 30 cols
## -> data : tibble converted into a data.frame
## -> target variable : 7172 values
## -> predict function : yhat.workflow will be used ( [33m default [39m )
## -> predicted values : No value for predict function target column. ( [33m default [39m )
## -> model_info : package tidymodels , ver. 0.1.3 , task regression ( [33m default [39m )
## -> predicted values : numerical, min = -76.57776 , mean = 19.57231 , max = 438.0276
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -101.4695 , mean = -0.3496883 , max = 599.5611
## [32m A new explainer has been created! [39m
rf_mod_perf <- model_performance(rf_explain)
hist_plot <-
plot(rf_mod_perf,
geom = "histogram")
box_plot <-
plot(rf_mod_perf,
geom = "boxplot")
hist_plot

box_plot

According to the feature importance bar chart below, we can see that the top three important features are market cap, earnings, M1SL_PC1, and GDP_PC1.
set.seed(10) #since we are sampling & permuting, we set a seed so we can replicate the results
rf_var_imp <-
model_parts(
rf_explain
)
plot(rf_var_imp, show_boxplots = TRUE)

After creating two models, we then move on to the final method: stacking. For the stacking model, we also add one more model: KNN to improve the accuracy rate of the model.
Stacking Model
Random Forest Candidate
ranger_recipe <-
recipe(formula = PROFIT ~ .,
data = data_training) %>%
# Make these evaluative variables, not included in modeling
update_role(all_of(c("YEAR", "Name", "Sector")),
new_role = "evaluative")
ranger_spec <-
rand_forest(mtry = 6,
min_n = 10,
trees = 200) %>%
set_mode("regression") %>%
set_engine("ranger")
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
ranger_fit <- ranger_workflow %>%
fit(data_training)
set.seed(1211) # for reproducibility
final_data_cv <- vfold_cv(data_training, v = 5)
metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()
ranger_cv <- ranger_workflow %>%
fit_resamples(final_data_cv,
metrics = metric,
control = ctrl_res)
LASSO Candiate
# lasso recipe and transformation steps
lasso_final_data_recipe <- recipe(PROFIT ~ .,
data = data_training) %>%
#step_rm(Name, Sector, YEAR, COMPANY) %>%
update_role(all_of(c("Name",
"Sector",
"YEAR")),
new_role = "evaluative") %>%
step_dummy(all_nominal(),
-all_outcomes(),
-has_role(match = "evaluative")) %>%
step_normalize(all_predictors(),
-all_nominal())
#define lasso model
lasso_mod <-
linear_reg(mixture = 1) %>%
set_engine("glmnet") %>%
set_args(penalty = tune()) %>%
set_mode("regression")
# create workflow
lasso_wf <-
workflow() %>%
add_recipe(lasso_final_data_recipe) %>%
add_model(lasso_mod)
# penalty grid - changed to 10 levels
penalty_grid <- grid_regular(penalty(),
levels = 10)
# add ctrl_grid - assures predictions and workflows are saved
ctrl_grid <- control_stack_grid()
# tune the model using the same cv samples as random forest
lasso_tune <-
lasso_wf %>%
tune_grid(
resamples = final_data_cv,
grid = penalty_grid,
metrics = metric,
control = ctrl_grid
)
KNN Candidate
# create a model definition
knn_mod <-
nearest_neighbor(
neighbors = tune("k")
) %>%
set_engine("kknn") %>%
set_mode("regression")
# create the workflow
knn_wf <-
workflow() %>%
add_model(knn_mod) %>%
add_recipe(lasso_final_data_recipe)
# tune it using 4 tuning parameters
knn_tune <-
knn_wf %>%
tune_grid(
final_data_cv,
metrics = metric,
grid = 4,
control = ctrl_grid
)
Stacking all candidates
By showing and visualize the members in our stacking model, we could see that random forest model contribute the most.
final_data_stack <-
stacks() %>%
add_candidates(ranger_cv) %>%
add_candidates(lasso_tune) %>%
add_candidates(knn_tune)
final_data_blend <-
final_data_stack %>%
blend_predictions()
final_data_blend
## # A tibble: 2 x 3
## member type weight
## <chr> <chr> <dbl>
## 1 ranger_cv_1_1 rand_forest 0.835
## 2 knn_tune_1_4 nearest_neighbor 0.254
autoplot(final_data_blend)

final_data_blend$metrics %>%
filter(.metric == "rmse") %>%
summarise(mean_rmse = mean(mean))
final_data_final_stack <- final_data_blend %>%
fit_members()
Table below shows the top10 prediction when we applied the model to testing data. Here, we can see that when the model predicts company with high rate of return, the actual return is high as well.
final_data_final_stack %>%
predict(new_data = data_testing) %>%
bind_cols(data_testing) %>%
select(Name, .pred, PROFIT) %>%
arrange(desc(.pred)) %>%
head(10)
Comparison of the three models
With the three models, we then move on to see which model performs the best:
Lasso model
data_frame(return_lm_tune %>%
select(id, .metrics) %>%
unnest(.metrics) %>%
filter(.metric == "rsq") %>%
filter(.config == "Preprocessor1_Model08") %>%
summarise(mean_rsq = mean(.estimate)),
return_lm_tune %>%
select(id, .metrics) %>%
unnest(.metrics) %>%
filter(.metric == "rmse") %>%
filter(.config == "Preprocessor1_Model08") %>%
summarise(mean_rmse = mean(.estimate)))
Random Forest Model
#OOB RMSE
data.frame(mean_rmse = sqrt(ranger_fit$fit$fit$fit$prediction.error),
# R squared
mean_rsq = ranger_fit$fit$fit$fit$r.squared)
Stacking Model
# Stacking model:
data.frame(final_data_blend$metrics %>%
filter(.metric == "rmse") %>%
summarise(mean_rmse = mean(mean)),
final_data_blend$metrics %>%
filter(.metric == "rsq") %>%
summarise(mean_rsq = mean(mean)))
Here, we can see that compared to the three models, even though Random Forest performs better than the stacking model, stacking model use features in the Random Forest along with additional features from KNN and lasso. With that reason, we will choose stacking model as our model choice.
Stock Return Prediction for 2021
After picking our model, we then move on to use the model to predict the potential profit for 2021:
set.seed(456)
ytd = c(80.30, 39.77, -8.26, 11.29, 22.08, 132.51, 137.04, -4.22, 39.77, 20.76, 26.41, 26.34, 41.50, 178.43, 18.70, 87.29, -6.16, 52.58, 8.70, -13.70)
set.seed(456)
pred_2021<-final_data_final_stack %>%
predict(new_data = final_data_2021) %>%
bind_cols(final_data_2021) %>%
select(Name, .pred) %>%
arrange(desc(.pred)) %>%
head(20) %>%
mutate(actual_ytd = ytd)
pred_2021 %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
column_spec(c(1,3), color = ifelse(pred_2021$actual_ytd < 0, "red", "darkgreen")) %>%
scroll_box(width = "100%", height = "500px")
|
Name
|
.pred
|
actual_ytd
|
|
APA Corporation
|
67.67959
|
80.30
|
|
Marathon Oil
|
63.13998
|
39.77
|
|
News Corp (Class B)
|
63.06597
|
-8.26
|
|
Under Armour (Class C)
|
62.82090
|
11.29
|
|
DuPont
|
62.70736
|
22.08
|
|
Ford
|
62.55623
|
132.51
|
|
Norwegian Cruise Line Holdings
|
62.50470
|
137.04
|
|
Under Armour (Class A)
|
61.68342
|
-4.22
|
|
Carnival Corporation
|
61.52755
|
39.77
|
|
DXC Technology
|
59.89012
|
20.76
|
|
Halliburton
|
59.45987
|
26.41
|
|
Raytheon Technologies
|
58.35294
|
26.34
|
|
Devon Energy
|
57.31403
|
41.50
|
|
Schlumberger
|
57.14758
|
178.43
|
|
ConocoPhillips
|
56.75287
|
18.70
|
|
Baker Hughes
|
56.39542
|
87.29
|
|
Hess Corporation
|
56.39404
|
-6.16
|
|
Southwest Airlines
|
56.14967
|
52.58
|
|
Bristol Myers Squibb
|
56.04499
|
8.70
|
|
Boeing
|
55.79926
|
-13.70
|
Conclusion
The weakness of our model is that the average error of prediction is still very high so if one wants to predict the exact return, our model won’t be ideal. But the strength of our model is that for stocks with extremely high returns, even though the prediction might not be that precise, highly likely, our model will predict positive returns. That means, in real life, if we choose the top stocks to invest in based on our prediction, it is less likely we are going to lose money. Another strength of using a model to help with investment is that it excludes our subjective feelings.
To make the model better, we could do more research and add more regressors. For example, some financial indicators that are important for value investing are not reflected in our model due to the lack of data. Examples of those variables include Price to Sales, Price to Cash Flow, and Price to Book. Also, since we are concerned about the long-term return and all fundamental factors usually take longer to affect the firms, it’s probably helpful to do return in 2 years or 3 years or include lag of some variables in our model.
---
title: "Final Project"
author: "Duc Ngo, Rita Liu, Sivhuo Prak"
output:
  html_document:
    df_print: paged
    toc: TRUE
    toc_float: TRUE
    code_download: true
    code_folding: hide
    theme: journal
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
```

```{r}
library(tidyverse)        # for data cleaning and plotting
library(tidymodels)       # for modeling ... tidily
library(lubridate)        # for date manipulation
library(openintro)        # for the abbr2state() function
library(gplots)           # for col2hex() function
library(RColorBrewer)     # for color palettes
library(ggthemes)         # for more themes (including theme_map())
library(plotly)           # for the ggplotly() - basic interactivity
library(gganimate)        # for adding animation layers to ggplots
library(transformr)       # for "tweening" (gganimate)
library(gifski)           # need the library for creating gifs but don't need to load each time
library(shiny)            # for creating interactive apps
library(janitor)  
library(stacks)            # for stacking models
library(glmnet)            # for regularized regression, including LASSO
library(ranger)            # for random forest model
library(kknn)              # for knn model
library(naniar)            # for examining missing values (NAs)
library(moderndive)        # for King County housing data
library(vip)               # for variable importance plots
library(rmarkdown)         # for paged tables
library(fastDummies)
library(usemodels)         # for suggesting step_XXX() functions
library(readr)
library(kableExtra)
library(DALEX)             # model Agnostic Language for exploration and explanation (for model interpretation)  
library(DALEXtra)
theme_set(theme_minimal())
```


## Introduction 

The financial market is a strange place that is very hard to navigate around. We have seen Warren Buffett, Ray Dalio, Charlie Munger - the very very best who have had billions of dollars. Others, 95% of the population, lose the money instead. 

So, how are the best of the best pick its stocks? It is from the fundamental, the technical side or the sentimental side? Within this paper, we hope to bring another perspective, using machine learning models to predict the profitability of the stock price. 

### Paper Outline 

First, we will list and explain the definition of each variable in our dataset. There are 25 variables in total as listed below. The variable we are going to predict is `PROFIT`. Then, we will start to process our data by merging with the macro data and removing some unwanted variables. After processing the data, we will have a data visualization section that shows the distribution and relationship between some variables. And to find the best model, we choose to explore three machine learning models which are LASSO, Random forest and Stacking models. We use a stacking method to create the stacking model by combining three models (Lasso, Random Forest, and KNN). Lastly, we will make a stock return prediction for 2021 using the trained model with the lowest R-squared and RMSE. 

### List of Variables 

For the dataset, we includes financial information on companies in the S&P 500 stock index from 1999-2021. This information was scraped from Yahoo Finance in November of 2021, and collected in a csv format for data analysis. The information includes metrics like sales, earnings, cogs, stock price, and market sector as well as macroeconomic data such as GDP or Money Supply. The goal is to analyze and model this data to better improve projections for a company’s future profitability. The variables in the data set are described below:

| Variable            | Meaning                                                                                                     |
|---------------------|-------------------------------------------------------------------------------------------------------------|
| YEAR                | The financial year of the company                                                                           |
| COMPANY             | The company’s stock abbreviation symbol                                                                     |
| MARKET.CAP          | The total market capitalization of the company (Volume * Price)                                             |
| EARNINGS            | The earnings in dollars for the previous year for the given company                                         |
| SALES               | How much the company sold in dollars last year                                                              |
| CASH                | How much cash the company has in dollars at the end of the previous year                                    |
| Name                | The full name of the company                                                                                |
| Sector              | The name of the sector that the company is a part of                                                        |
| Earnings_next_year  | The amount of money in dollars that the company earns in the following year                                 |
| PRICE               | The price of the stock when it is bought                                                                    | 
| Sell                | The price of the stock when it is sold                                                                      |
| VOLUME              | The total number of shares that the company holds                                                           |
| COGS                | The total amount the company paid as a cost directly related to the sale of products                        | 
| INVESTMENT          | The total asset or item acquired with the goal of generating income or appreciation                         |
| RECIEVABLE          | The debts owed to a company by its customers for goods that have been delivered or used but not yet paid for| 
| INVENTORY           | How much raw materials used in production as well as the goods produced that are available for sale         | 
| DEBTS               | How much money the company borrow from other parties                                                        | 
| CPALTT01USM657N_PC1 | The percentage change in CPI (measure of inflation)                                                         | 
| GDP                 | The monetary value of all finished goods and services made within a country                                 |
| GDP_PC1             | The percentage change in GDP                                                                                |
| T10Y2Y              | Ten year treasury bonds minus two year treasury bonds                                                       |
| M1SL                | The total currency and other liquid instruments in a country's economy                                      |
| M1SL_PC1            | The percentage change in money supply                                                                       |
| Earnings_next_year  | How much profit that a company produces next year                                                           | 
| PROFIT              | How much the money made or lost on an investment                                                            |

### Loading data

```{r}
finalDATASET <- read_csv("FINALDATASET.csv")
macro_data <- read_csv("clean_macro - Sheet1.csv")
```

## Data Preprocessing 
For the data preprocessing, we combined our data sources and filled missing values for some fundamental factors by using the median of that value in the sector for a certain year. We also created sector dummies in this step. We then dropped some variables that are not in our interests and the splited the testing and trainning data. 
```{r}
final_data <- finalDATASET %>% 
  # make Sector dummy variables
  dummy_cols(select_columns = "Sector") %>% 
  # merge with updated macro data
  select(-CPALTT01USM657N_PC1,
         -GDP,
         -GDP_PC1,
         -M1SL_PC1 ,
         -M1SL,
         -PRICE,
         -Sell,
         -COMPANY) %>% 
  merge(macro_data) %>% 
  # convert Macro factors from characters to numeric
  mutate(across(c("CPALTT01USM657N_PC1","GDP","GDP_PC1","T10Y2Y","M1SL_PC1","M1SL"),
                as.numeric)) %>% 
  group_by(YEAR,Sector) %>% 
  # replacing the missing value with median od the industry in that year
  mutate(across(c(DEBTS,INVESTMENTS,CASH,VOLUME,EARNINGS,COGS,SALES,RECEIVABLE,INVENTORY),~replace(.,.==0,median(.)))) %>%
  # delete less important factors -> Exchange can possibily be deleted
  ungroup() %>%  
  mutate(across(c(!where(is.numeric),-"Name"),as.factor)) %>% 
  select(-Earnings_next_year,-observation_date) %>% 
  drop_na() 

# filter out the data for 2021
final_data_2021 <- final_data %>% 
  filter(YEAR == 2021)

final_data <- final_data %>% 
  filter(YEAR < 2021)

# split the data
set.seed(327) #for reproducibility

data_split <- initial_split(final_data, 
                             prop = .75)
data_training <- training(data_split)
data_testing <- testing(data_split)

# quick look of the data 
final_data %>% 
  head(5)
```

## Data Visualization
We first explored distributions of our predictors and outcome. As the following graph suggested, all the numeric variables are severely right-skewed with some outliers. In our data, though the numbers of firms in each sector are unbalanced, we got a decent amount of data for each sector.  
```{r}

  final_data %>% 
  select(where(is.numeric)) %>% 
  select(-starts_with("Sector_")) %>% 
  pivot_longer(cols = everything(),
               names_to = "variable", 
               values_to = "value") %>% 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(variable), 
             scales = "free",
             nrow = 4,
             ncol=5)

```
```{r}
final_data %>% 
  select(where(is.factor)) %>% 
  pivot_longer(cols = everything(),
               names_to = "variable", 
               values_to = "value") %>% 
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(vars(variable), 
             scales = "free", 
             nrow = 4) + 
  theme(axis.text.x = element_text(angle = 45))

```

  
Then, we continued to explore the relationship between regressors and the outcome. By observation, we can't see a strong correlation between fundemental factors, such as investments and debts, and stock return. It may indicate that linear model is not an ideal model in this case. 

```{r}
final_data %>% 
  ggplot(aes(x = INVESTMENTS, y = PROFIT,color = YEAR)) + 
  geom_point(alpha = 0.5)+
  geom_smooth(se = FALSE) + 
  labs(title = "Relationship between investments and stock return")
```

```{r}
final_data %>% 
  ggplot(aes(x = DEBTS, y = PROFIT,color = YEAR)) + 
  geom_point(alpha = 0.5)+
  geom_smooth(se = FALSE) + 
  labs(title = "Relationship between debts and stock return")
```

With this anamiation, we could see that the returns of stocks follow a cycle, which may be influenced by macroeconomic condition. The return in 1999 and 2019 seems to be the highest for all industries. Also, we noticed that some industries vary a lot year to year, such as IT and healthcare industry. 

```{r eval = FALSE}
sector_return_an<-final_data %>%
  ggplot(aes(x = PROFIT, y = Sector)) +
  geom_boxplot(aes(color = Sector),
             alpha = .8,
             size = 1) + 
  labs(title = "Spread of stock return in different sector",
       subtitle = "YEAR: {closest_state}",
       color = "") + 
  transition_states(YEAR)
  
animate(sector_return_an, duration = 25)
anim_save("sector_return.gif")
```

```{r}
knitr::include_graphics("sector_return.gif")
```

We further explored the relationship between the industry and the potential macroeconomic influencer using graphs below. Though the relationship doesn't seem to be linear, we do see how macroeconomic condition affect each industry differently and will account for that in our model.

```{r}
final_data %>% 
  group_by(YEAR,Sector) %>% 
  mutate(profit_por = PROFIT*`MARKET CAP`/(sum(`MARKET CAP`,na.rm = TRUE))) %>% 
  summarise(Sector_profit = sum(profit_por),
            GDP_PC1 = mean(GDP_PC1),
            CPALTT01USM657N_PC1 = mean(CPALTT01USM657N_PC1),
            Sector = Sector[1]) %>% 
  ggplot(aes(x = CPALTT01USM657N_PC1,y = Sector_profit,color = Sector))+
  geom_smooth(se = FALSE) + 
  labs(title = "Relationship between inflation and average return of a industry",
       x = "Percent Change in CPI",
       y = "Average Return of a sector")
  
```

```{r}
final_data %>% 
  group_by(YEAR,Sector) %>% 
  mutate(profit_por = PROFIT*`MARKET CAP`/(sum(`MARKET CAP`,na.rm = TRUE))) %>% 
  summarise(Sector_profit = sum(profit_por),
            GDP_PC1 = mean(GDP_PC1),
            CPALTT01USM657N_PC1 = mean(CPALTT01USM657N_PC1),
            Sector = Sector[1]) %>% 
  ggplot(aes(x = GDP_PC1,y = Sector_profit,color = Sector))+
  geom_smooth(se = FALSE) + 
  labs(title = "Relationship between GDP  and average return of a industry",
       x = "Percent Change in GDP",
       y = "Average Return of a sector")
```


## Lasso Model 
### Building Recipe

In the lasso model, to account for the fact the macroeconomic condition affects each industry differently, we also included the interaction term between GDP and sector dummies. 

```{r}
return_recipe <- recipe(PROFIT ~ ., #short-cut, . = all other vars
                       data = data_training) %>% 
  # filter to only have data after 2020
  step_filter(YEAR<2021) %>% 
  step_rm(Name,Sector,YEAR) %>% 
  #step_rm(GDP,M1SL,Name,Sector,YEAR) %>% 
  # add PE 
  step_mutate(PE = `MARKET CAP`/EARNINGS) %>%
  # Normalize all variables except for GDP
  step_normalize(all_predictors(), 
                 -all_nominal(),
                 -starts_with("Sector_")) %>% 
  # Create interaction terms
  step_interact(terms = ~c(GDP_PC1):starts_with("Sector_")) 
# show the data in recipe
return_recipe %>% 
  prep(data_training) %>%
  # using bake(new_data = NULL) gives same result as juice()
  # bake(new_data = NULL)
  juice() 
```

### Select tuning parameter

```{r}
return_linear_mod <- 
  # Define a lasso model 
  # I believe default is mixture = 1 so probably don't need 
  linear_reg(mixture = 1) %>% 
  # Set the engine to "glmnet" 
  set_engine("glmnet") %>% 
  # The parameters we will tune.
  set_args(penalty = tune()) %>% 
  # Use "regression"
  set_mode("regression")

set.seed(456)
return_lm_wf <- 
  # Set up the workflow
  workflow() %>% 
  # Add the recipe
  add_recipe(return_recipe) %>% 
  # Add the modeling
  add_model(return_linear_mod)

penalty_grid <- grid_regular(penalty(),
                             levels = 10)

return_cv <- vfold_cv(data_training, v = 5)

return_lm_tune <- 
  return_lm_wf %>% 
  tune_grid(
    resamples = return_cv,
    grid = penalty_grid
    )

best_param<-return_lm_tune %>% 
  select_best(metric = "rmse")

return_lasso_final_wf <- return_lm_wf %>% 
  finalize_workflow(best_param)

return_lasso_final_mod <- return_lasso_final_wf %>% 
  fit(data = data_training)

# visulization for best param
set.seed(456)
return_lm_tune %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = penalty, y = mean)) +
  geom_point() +
  geom_line() +
  scale_x_log10(
   breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10",scales::math_format(10^.x))) +
  labs(x = "penalty", y = "rmse")
```



### Lasso results

The table below shows the estimate of each predictor in the Lasso Model.  We can see that indicators in macroeconomics are relatively important to predict the stock return. The change in inflation and GDP all remain significant after shrinking. The interaction terms between GDP and industries also showed importance, accounting for the fact that the macroeconomic condition affects each industry differently: GDP seems to affect the stock return in Communication Services, Energy, and Consumer Discretionary sectors less. 

```{r}
return_lasso_final_mod %>% 
  pull_workflow_fit() %>% 
  tidy() %>% 
  arrange(desc(estimate)) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>% 
  scroll_box(width = "100%", height = "500px")
```

### Model Evaluation

#### Prediciton precision 

The Lasso model does not perform well. It only explains 20% of the varaince in stock return and it also has a relatively high rmse of around 46. 

```{r}
return_lm_tune %>%
  select(id, .metrics) %>%
  unnest(.metrics) %>%
  filter(.metric == "rsq") %>% 
  filter(.config == "Preprocessor1_Model08") %>% 
  summarise(mean_rsq = mean(.estimate))

return_lm_tune %>%
  select(id, .metrics) %>%
  unnest(.metrics) %>%
  filter(.metric == "rmse") %>% 
  filter(.config == "Preprocessor1_Model08") %>% 
  summarise(mean_rmse = mean(.estimate))
```

```{r eval = FALSE}
set.seed(456)
prediction <- predict(
  return_lasso_final_mod,
  new_data = data_training)

training_pred<-data_training %>% 
  mutate(.pred = prediction$.pred)
  
  
training_pred %>% 
  ggplot(aes(x = PROFIT, 
             y = .pred,
             color = YEAR)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  geom_text(aes(label = Name),label.size = 0.15,data = training_pred %>% filter(PROFIT>400) )+
  labs(x = "Actual Return", 
       y = "Predicted Return") +
  scale_color_viridis_b()
```

#### Overfitting

Below shows the r-square and rmse on testing data. They are very similar to those on trainning data so we would safely conclude that our model did not overfit. 
```{r}
return_lasso_test <- return_lasso_final_wf %>% 
  last_fit(data_split)

# Metrics for model applied to test data
return_lasso_test %>% 
  collect_metrics()
```

The following graphs show the performance of the model on the testing data: 

```{r}
test_prediction <- predict(
  return_lasso_final_mod,
  new_data = data_testing)

testing_lasso_pred<-data_testing %>% 
  mutate(.pred = test_prediction$.pred)
  
  
testing_lasso_pred %>% 
  ggplot(aes(x = PROFIT, 
             y = .pred,
             color = YEAR)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  geom_text(aes(label = Name),label.size = 0.15,data = testing_lasso_pred %>% filter(PROFIT>300) )+
  labs(x = "Actual Return", 
       y = "Predicted Return") + 
  scale_color_viridis_b()
```

### Interpretable Machine Learning 

Again, the boxplot and the histogram of residuals, the model does not predict the return well and this may because some extreme values. 
```{r}
lasso_explain <- 
  explain_tidymodels(
    model = return_lasso_final_mod,
    data = data_training %>% select(-PROFIT), 
    y = data_training %>%  pull(PROFIT),
    label = "lasso"
  )
```

```{r}
lasso_mod_perf <- model_performance(lasso_explain)
hist_plot <- 
  plot(lasso_mod_perf,
       geom = "histogram")
box_plot <-
  plot(lasso_mod_perf,
       geom = "boxplot")

hist_plot
box_plot

```

Below shows the feature importance plots generated by two different methods. We could see that macroeconomic indicators remained importance in both of the plots but market cap seems to be an crucial factor when we use permutation method.

```{r}
set.seed(10) #since we are sampling & permuting, we set a seed so we can replicate the results
lasso_var_imp <- 
  model_parts(
    lasso_explain
    )

plot(lasso_var_imp, show_boxplots = TRUE)
```
```{r}
return_lasso_final_mod %>% 
  pull_workflow_fit() %>% 
  vip()
```

## Random Forest Model 

### Building Random Forest model 

To build the random forest model, we set up recipe using our training data (`data_training`), define model with `mtry = 6`, `min_n = 10` and `numbers of tree = 200`, create ranger workflow, and then we can fit the model. 

```{r}
# set up recipe 
ranger_recipe <-
  recipe(PROFIT ~ ., #short-cut, . = all other vars
                       data = data_training) %>%
  step_filter(YEAR<2021) %>% 
  # remove the unwanted variables
  step_rm(YEAR,Name,GDP,M1SL,Sector) %>% 
  # add PE 
  step_mutate(PE = `MARKET CAP`/EARNINGS)

#define model
ranger_spec <- 
  rand_forest(mtry = 6, 
              min_n = 10, 
              trees = 200) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

#create workflow
ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec)

#fit the model
set.seed(712) # for reproducibility - random sampling in random forest choosing number of variables
ranger_fit <- ranger_workflow %>% 
  fit(data_training)
```

### Random Forest results

The table below shows the OOB error (MSE), OOB RMSE (Root mean square error), and R squared of the stock return predictions using the random forest model above.

OOB error: 
```{r}
ranger_fit$fit$fit$fit$prediction.error
```

OOB RMSE: 
```{r}
sqrt(ranger_fit$fit$fit$fit$prediction.error)
```

R Squared: 
```{r}
ranger_fit$fit$fit$fit$r.squared
```

### Model Evaluation
#### Prediciton precision (Traning data)

Even though the mean rmse from the random forest (39.30787) seems to be fairly high but it is still lower than the LASSO model of 46. Thus, random forest model performs better. 

```{r}
set.seed(1211) # for reproducibility
data_cv <- vfold_cv(data_training, v = 5)

metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()

ranger_cv <- ranger_workflow %>% 
  fit_resamples(data_cv, 
                metrics = metric,
                control = ctrl_res)

# Evaluation metrics averaged over all folds:
collect_metrics(ranger_cv)
```

We also plot a graph showing the actual return vs predicted return. 

```{r}
ranger_prediction <- predict(
  ranger_fit,
  new_data = data_training)

ranger_training_pred<-data_training %>% 
  mutate(.pred = ranger_prediction$.pred)
  
  
ranger_training_pred %>%
  ggplot(aes(x = PROFIT,
             y = .pred)) +
  geom_point(alpha = .5,
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1,
              intercept = 0,
              color = "darkred") +
  labs(x = "Actual Return",
       y = "Predicted Return")

```

#### Prediciton precision (Testing data)

The table below shows the rmse on testing data. It is a bit higher but fairly similar to the error on training data so we could say that our model did not overfit. 

```{r}
set.seed(1211) # for reproducibility
data_cv <- vfold_cv(data_testing, v = 5)

metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()

ranger_cv <- ranger_workflow %>% 
  fit_resamples(data_cv, 
                metrics = metric,
                control = ctrl_res)

# Evaluation metrics averaged over all folds:
collect_metrics(ranger_cv)
```

Below is the graph showing the actual return vs. predicted return on testing data. 

```{r}
ranger_test_prediction <- predict(
  ranger_fit,
  new_data = data_testing)

ranger_test_pred<-data_testing %>% 
  mutate(.pred = ranger_test_prediction$.pred)
  
  
ranger_test_pred %>%
  ggplot(aes(x = PROFIT,
             y = .pred)) +
  geom_point(alpha = .5,
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1,
              intercept = 0,
              color = "darkred") +
  labs(x = "Actual Return",
       y = "Predicted Return")

```

### Interpretable Machine Learning 

Based on the box-plot and the histogram below, the residuals mostly lie between -50 to 50. But there are some a few outliners that can go up to 400. 

```{r}
rf_explain <- 
  explain_tidymodels(
    model = ranger_fit,
    data = data_training %>% select(-PROFIT), 
    y = data_training %>%  pull(PROFIT),
    label = "rf"
  )
rf_mod_perf <-  model_performance(rf_explain)

hist_plot <- 
  plot(rf_mod_perf, 
       geom = "histogram")
box_plot <-
  plot(rf_mod_perf, 
       geom = "boxplot")

hist_plot
box_plot
```

According to the feature importance bar chart below, we can see that the top three important features are market cap, earnings, M1SL_PC1, and GDP_PC1. 

```{r}
set.seed(10) #since we are sampling & permuting, we set a seed so we can replicate the results
rf_var_imp <- 
  model_parts(
    rf_explain
    )
plot(rf_var_imp, show_boxplots = TRUE)
```

After creating two models, we then move on to the final method: stacking. For the stacking model, we also add one more model: KNN to improve the accuracy rate of the model.

## Stacking Model

### Random Forest Candidate 

```{r}
ranger_recipe <- 
  recipe(formula = PROFIT ~ ., 
         data = data_training) %>% 
  # Make these evaluative variables, not included in modeling
  update_role(all_of(c("YEAR", "Name", "Sector")),
              new_role = "evaluative")

ranger_spec <- 
  rand_forest(mtry = 6, 
              min_n = 10, 
              trees = 200) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

ranger_fit <- ranger_workflow %>% 
  fit(data_training)

set.seed(1211) # for reproducibility
final_data_cv <- vfold_cv(data_training, v = 5)

metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()

ranger_cv <- ranger_workflow %>% 
  fit_resamples(final_data_cv, 
                metrics = metric,
                control = ctrl_res)
```

### LASSO Candiate

```{r}
# lasso recipe and transformation steps
lasso_final_data_recipe <- recipe(PROFIT ~ ., 
                       data = data_training) %>% 
  #step_rm(Name, Sector, YEAR, COMPANY) %>%
  update_role(all_of(c("Name",
                       "Sector",
                       "YEAR")),
              new_role = "evaluative") %>% 
  step_dummy(all_nominal(), 
             -all_outcomes(), 
             -has_role(match = "evaluative")) %>% 
  step_normalize(all_predictors(), 
                 -all_nominal())

#define lasso model
lasso_mod <- 
  linear_reg(mixture = 1) %>% 
  set_engine("glmnet") %>% 
  set_args(penalty = tune()) %>% 
  set_mode("regression")

# create workflow
lasso_wf <- 
  workflow() %>% 
  add_recipe(lasso_final_data_recipe) %>% 
  add_model(lasso_mod)

# penalty grid - changed to 10 levels
penalty_grid <- grid_regular(penalty(),
                             levels = 10)

# add ctrl_grid - assures predictions and workflows are saved
ctrl_grid <- control_stack_grid()

# tune the model using the same cv samples as random forest

lasso_tune <- 
  lasso_wf %>% 
  tune_grid(
    resamples = final_data_cv,
    grid = penalty_grid,
    metrics = metric,
    control = ctrl_grid
    )
```

### KNN Candidate

```{r}
# create a model definition
knn_mod <-
  nearest_neighbor(
    neighbors = tune("k")
  ) %>%
  set_engine("kknn") %>% 
  set_mode("regression")

# create the workflow
knn_wf <- 
  workflow() %>% 
  add_model(knn_mod) %>%
  add_recipe(lasso_final_data_recipe)

# tune it using 4 tuning parameters
knn_tune <- 
  knn_wf %>% 
  tune_grid(
    final_data_cv,
    metrics = metric,
    grid = 4,
    control = ctrl_grid
  )
```


### Stacking all candidates
By showing and visualize the members in our stacking model, we could see that random forest model contribute the most. 

```{r}
final_data_stack <- 
  stacks() %>% 
  add_candidates(ranger_cv) %>% 
  add_candidates(lasso_tune) %>% 
  add_candidates(knn_tune)

final_data_blend <- 
  final_data_stack %>% 
  blend_predictions()

final_data_blend
```

```{r}
autoplot(final_data_blend)
```

```{r}
final_data_blend$metrics %>% 
  filter(.metric == "rmse") %>% 
  summarise(mean_rmse = mean(mean))
```



```{r}
final_data_final_stack <- final_data_blend %>% 
  fit_members()
```

Table below shows the top10 prediction when we applied the model to testing data. Here, we can see that when the model predicts company with high rate of return, the actual return is high as well. 
```{r}
final_data_final_stack %>% 
  predict(new_data = data_testing) %>% 
  bind_cols(data_testing) %>% 
  select(Name, .pred, PROFIT) %>% 
  arrange(desc(.pred)) %>% 
  head(10)
```



### Comparison of the three models

With the three models, we then move on to see which model performs the best: 

#### Lasso model
```{r}
data_frame(return_lm_tune %>%
  select(id, .metrics) %>%
  unnest(.metrics) %>%
  filter(.metric == "rsq") %>% 
  filter(.config == "Preprocessor1_Model08") %>% 
  summarise(mean_rsq = mean(.estimate)),

return_lm_tune %>%
  select(id, .metrics) %>%
  unnest(.metrics) %>%
  filter(.metric == "rmse") %>% 
  filter(.config == "Preprocessor1_Model08") %>% 
  summarise(mean_rmse = mean(.estimate)))
```

#### Random Forest Model

```{r}
#OOB RMSE
data.frame(mean_rmse = sqrt(ranger_fit$fit$fit$fit$prediction.error),
# R squared
mean_rsq = ranger_fit$fit$fit$fit$r.squared)
```


#### Stacking Model 

```{r}
# Stacking model: 
data.frame(final_data_blend$metrics %>% 
  filter(.metric == "rmse") %>% 
  summarise(mean_rmse = mean(mean)),

final_data_blend$metrics %>% 
  filter(.metric == "rsq") %>% 
  summarise(mean_rsq = mean(mean)))
```
Here, we can see that compared to the three models, even though Random Forest performs better than the stacking model, stacking model use features in the Random Forest along with additional features from KNN and lasso. With that reason, we will choose stacking model as our model choice. 

## Stock Return Prediction for 2021

After picking our model, we then move on to use the model to predict the potential profit for 2021: 

```{r}
set.seed(456)
ytd = c(80.30, 39.77, -8.26, 11.29, 22.08, 132.51, 137.04, -4.22, 39.77, 20.76, 26.41, 26.34, 41.50, 178.43, 18.70, 87.29, -6.16, 52.58, 8.70, -13.70)
set.seed(456)
pred_2021<-final_data_final_stack %>% 
  predict(new_data = final_data_2021) %>% 
  bind_cols(final_data_2021) %>% 
  select(Name, .pred) %>% 
  arrange(desc(.pred)) %>% 
  head(20) %>% 
  mutate(actual_ytd = ytd) 
pred_2021 %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>% 
  column_spec(c(1,3), color = ifelse(pred_2021$actual_ytd < 0, "red", "darkgreen")) %>% 
  scroll_box(width = "100%", height = "500px") 
```

## Conclusion 

The weakness of our model is that the average error of prediction is still very high so if one wants to predict the exact return, our model won't be ideal. But the strength of our model is that for stocks with extremely high returns, even though the prediction might not be that precise, highly likely, our model will predict positive returns. That means, in real life, if we choose the top stocks to invest in based on our prediction, it is less likely we are going to lose money. Another strength of using a model to help with investment is that it excludes our subjective feelings. 

To make the model better, we could do more research and add more regressors. For example, some financial indicators that are important for value investing are not reflected in our model due to the lack of data.  Examples of those variables include Price to Sales, Price to Cash Flow, and Price to Book. Also, since we are concerned about the long-term return and all fundamental factors usually take longer to affect the firms, it's probably helpful to do return in 2 years or 3 years or include lag of some variables in our model.  


